Skip to content

Conversation

@JC-ut0
Copy link
Contributor

@JC-ut0 JC-ut0 commented Nov 22, 2025

What this PR does / why we need it?

Add Qwen3-235B tutorial including the following examples

  • Single-node Online Deployment for 128k context inference
  • Multi-node Deployment with MP

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new tutorial for running the Qwen3-235B model. The documentation is well-structured and provides good detail. I've found a couple of critical typos in model names within commands that would cause them to fail, and a potentially confusing or incorrect configuration for cudagraph_capture_sizes. I've left specific comments with suggestions to fix these issues.

Signed-off-by: xuyexiong <[email protected]>
Signed-off-by: xuyexiong <[email protected]>
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: xuyexiong <[email protected]>
Signed-off-by: xuyexiong <[email protected]>
Signed-off-by: xuyexiong <[email protected]>
Signed-off-by: xuyexiong <[email protected]>
--gpu-memory-utilization 0.95 \
--rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false}}' \
--compilation-config '{"cudagraph_capture_sizes":[1,4],"cudagraph_mode":"FULL_DECODE_ONLY"}' \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example we provided represents the best practice under normal circumstances: optimal performance under stable operating conditions. Is this Capature size value a bit too small?

Copy link
Contributor Author

@JC-ut0 JC-ut0 Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cudagraph_capture_sizes is set depending on the -max-num-seqs 4. This is an optimal example for 128k sequence inference.

--quantization ascend \
--served-model-name qwen3 \
--max-num-seqs 4 \
--max-model-len 133000 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of max-model-len should been 131072. I try to run this command but got following error:

(APIServer pid=598)   File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 994, in create_model_config
(APIServer pid=598)     return ModelConfig(
(APIServer pid=598)            ^^^^^^^^^^^^
(APIServer pid=598)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=598)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=598) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=598)   Value error, User-specified max_model_len (133000) is greater than the derived max_model_len (max_position_embeddings=131072 or model_max_length=None in model's config.json). To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1. VLLM_ALLOW_LONG_MAX_MODEL_LEN must be used with extreme caution. If the model uses relative position encoding (RoPE), positions exceeding derived_max_model_len lead to nan. If the model uses absolute position encoding, positions exceeding derived_max_model_len will cause a CUDA array out-of-bounds error. [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add this parameter --rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}'

Signed-off-by: xuyexiong <[email protected]>
@JC-ut0 JC-ut0 changed the title Add Qwen3-235B tutorial [Doc] Add Qwen3-235B tutorial Dec 4, 2025
Signed-off-by: xuyexiong <[email protected]>
Signed-off-by: xuyexiong <[email protected]>
Signed-off-by: xuyexiong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants